Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification

نویسندگان

  • Chu-Ren Huang
  • Petr Simon
  • Shu-Kai Hsieh
  • Laurent Prévot
چکیده

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of characterboundaries (CB’s) into either word-boundaries (WB’s) and non-word-boundaries. In Chinese, CB’s are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB’s are WB’s.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Segmentation for Urdu OCR System

This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A me...

متن کامل

Chunking-based Chinese Word Tokenization

() () , (log) (log) Abstract This paper introduces a Chinese word tokenization system through HMM-based chunking. Experiments show that such a system can well deal with the unknown word problem in Chinese word tokenization. The second term in (2-1) is the mutual information between T and. In order to simplify the computation of this term, we assume mutual information independence (2-2): 1 1 log...

متن کامل

Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese

Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in differe...

متن کامل

Chinese text word-segmentation considering semantic links among sentences

Tokenization of Chinese input text into words is a necessary step to realize a Mandarin Chinese text-to-speech. Several word-segmentation algorithms were developed in which linguistic information are combined with statistical ones or with heuristic rules. In this paper we investigate in the advantages that can arise when semantic relation among sentences is taken into account during the word se...

متن کامل

Chinese Sentence Tokenization Using a Word Classifier

In this paper, we explore a Chinese sentence tokenizer built using a word classifier. In contrast to the state of the art conditional random field approaches, this one is simple to implement and easy to train. The work is broken down into two pieces: the sentence maximizer makes guesses over a large number of sentence tokenization candidates and scores each one. The highest scored sentence toke...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007